First, we start loading the data

cols(
  id = col_double(),
  listing_url = col_character(),
  scrape_id = col_double(),
  last_scraped = col_date(format = ""),
  name = col_character(),
  description = col_character(),
  neighborhood_overview = col_character(),
  picture_url = col_character(),
  host_id = col_double(),
  host_url = col_character(),
  host_name = col_character(),
  host_since = col_date(format = ""),
  host_location = col_character(),
  host_about = col_character(),
  host_response_time = col_character(),
  host_response_rate = col_character(),
  host_acceptance_rate = col_character(),
  host_is_superhost = col_logical(),
  host_thumbnail_url = col_character(),
  host_picture_url = col_character(),
  host_neighbourhood = col_character(),
  host_listings_count = col_double(),
  host_total_listings_count = col_double(),
  host_verifications = col_character(),
  host_has_profile_pic = col_logical(),
  host_identity_verified = col_logical(),
  neighbourhood = col_character(),
  neighbourhood_cleansed = col_character(),
  neighbourhood_group_cleansed = col_character(),
  latitude = col_double(),
  longitude = col_double(),
  property_type = col_character(),
  room_type = col_character(),
  accommodates = col_double(),
  bathrooms = col_logical(),
  bathrooms_text = col_character(),
  bedrooms = col_double(),
  beds = col_double(),
  amenities = col_character(),
  price = col_character(),
  minimum_nights = col_double(),
  maximum_nights = col_double(),
  minimum_minimum_nights = col_double(),
  maximum_minimum_nights = col_double(),
  minimum_maximum_nights = col_double(),
  maximum_maximum_nights = col_double(),
  minimum_nights_avg_ntm = col_double(),
  maximum_nights_avg_ntm = col_double(),
  calendar_updated = col_logical(),
  has_availability = col_logical(),
  availability_30 = col_double(),
  availability_60 = col_double(),
  availability_90 = col_double(),
  availability_365 = col_double(),
  calendar_last_scraped = col_date(format = ""),
  number_of_reviews = col_double(),
  number_of_reviews_ltm = col_double(),
  number_of_reviews_l30d = col_double(),
  first_review = col_date(format = ""),
  last_review = col_date(format = ""),
  review_scores_rating = col_double(),
  review_scores_accuracy = col_double(),
  review_scores_cleanliness = col_double(),
  review_scores_checkin = col_double(),
  review_scores_communication = col_double(),
  review_scores_location = col_double(),
  review_scores_value = col_double(),
  license = col_character(),
  instant_bookable = col_logical(),
  calculated_host_listings_count = col_double(),
  calculated_host_listings_count_entire_homes = col_double(),
  calculated_host_listings_count_private_rooms = col_double(),
  calculated_host_listings_count_shared_rooms = col_double(),
  reviews_per_month = col_double()
)
  • after we had downloaded the raw data, there are 74 variables and 18,909 observations in total. There are 37 variables in numeric (e.g., id or scrape_id) and 25 variables in character (e.g., listing_url or name).

Then, we select some variables to our data frame, recalculate in proper way, make it ready to use and count the room available for each property type.

subset_listings <- listings %>% 
   dplyr::select(host_since, host_response_rate, host_is_superhost, host_listings_count, latitude, longitude, room_type, accommodates, bedrooms, beds,price, minimum_nights, maximum_nights, number_of_reviews, number_of_reviews_ltm, review_scores_rating, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, review_scores_accuracy, instant_bookable,property_type)

subset_listings$price = as.double(substr(subset_listings$price,2,nchar(subset_listings$price)))
subset_listings$host_since = as.numeric(subset_listings$host_since)
subset_listings$host_response_rate = as.numeric(sub("%","",subset_listings$host_response_rate))/100

subset_listings$host_is_superhost = as.integer(subset_listings$host_is_superhost)
subset_listings$instant_bookable = as.integer(subset_listings$instant_bookable)

k<-subset_listings %>% 
  count(property_type) %>% 
  arrange(desc(n))
head(k)
property_typen
Entire rental unit9404
Private room in rental unit5311
Private room in residential home816
Entire loft603
Entire condominium (condo)434
Private room in condominium (condo)343
subset_listings <- subset_listings %>%
  mutate(prop_type_simplified = case_when(
    property_type %in% c("Entire rental unit","Private room in rental unit","Private room in residential home","Entire loft") ~ property_type, 
    TRUE ~ "Other"
  )) %>% 
   dplyr::select(-property_type)

subset_listings_categorical <- subset_listings %>% 
   dplyr::select(host_is_superhost, room_type, accommodates, instant_bookable,prop_type_simplified)

subset_listings_continous <- subset_listings %>% 
   dplyr::select(host_response_rate,host_since,host_listings_count, latitude, longitude, accommodates, bedrooms, beds,price, minimum_nights, maximum_nights, number_of_reviews, number_of_reviews_ltm, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, review_scores_accuracy)

To create a full set of dummy variables,

dmy <- dummyVars(" ~ .", data = subset_listings)
subset_listings_finished <- data.frame(predict(dmy, newdata = subset_listings))

head(subset_listings_finished)
host_sincehost_response_ratehost_is_superhosthost_listings_countlatitudelongituderoom_typeEntire.home.aptroom_typeHotel.roomroom_typePrivate.roomroom_typeShared.roomaccommodatesbedroomsbedspriceminimum_nightsmaximum_nightsnumber_of_reviewsnumber_of_reviews_ltmreview_scores_ratingreview_scores_cleanlinessreview_scores_checkinreview_scores_communicationreview_scores_locationreview_scores_valuereview_scores_accuracyinstant_bookableprop_type_simplifiedEntire.loftprop_type_simplifiedEntire.rental.unitprop_type_simplifiedOtherprop_type_simplifiedPrivate.room.in.rental.unitprop_type_simplifiedPrivate.room.in.residential.home
1.44e+041   1140.5-3.6800102116011.12e+038044.874.814.8 4.894.774.854.91000010
1.47e+04   0240.4-3.74001011131440       3304.584.564.754.824.214.674.72000010
1.47e+040.9701040.4-3.7 10006355015730       00                     001000
1.84e+04   0140.4-3.711000319251.12e+031224.924.835   5   5   4.834.75001000
1.47e+041   0140.4-3.6900101112621.12e+0314904.684.924.784.714.7 4.724.8 000001
1.47e+040.9701040.4-3.691000433851365       634   4.2 4.8 4.4 4.2 4.4 4.2 001000

1 Exploratory Data Analysis (EDA)

Conduct a thorough EDA. Recall that an EDA involves three things:

  • Looking at the raw values.
    • dplyr::glimpse()
dplyr::glimpse(subset_listings)
Rows: 18,909
Columns: 24
$ host_since                  <dbl> 14350, 14661, 14657, 18397, 14726, 14657, ~
$ host_response_rate          <dbl> 1.00, NA, 0.97, NA, 1.00, 0.97, 0.25, NA, ~
$ host_is_superhost           <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, ~
$ host_listings_count         <dbl> 1, 2, 10, 1, 1, 10, 3, 4, 1, 1, 12, 3, 1, ~
$ latitude                    <dbl> 40.45724, 40.40381, 40.38840, 40.42183, 40~
$ longitude                   <dbl> -3.67688, -3.74130, -3.69511, -3.70529, -3~
$ room_type                   <chr> "Private room", "Private room", "Entire ho~
$ accommodates                <dbl> 2, 1, 6, 3, 1, 4, 2, 1, 2, 5, 4, 3, 3, 1, ~
$ bedrooms                    <dbl> 1, 1, 3, NA, 1, 3, 1, 1, 1, 2, 1, 1, 2, 1,~
$ beds                        <dbl> 1, 1, 5, 1, 1, 3, 2, 1, 1, 3, 2, 3, 2, 1, ~
$ price                       <dbl> 60, 31, 50, 92, 26, 85, 65, 15, 54, NA, 81~
$ minimum_nights              <dbl> 1, 4, 15, 5, 2, 1, 5, 1, 3, 3, 3, 5, 10, 1~
$ maximum_nights              <dbl> 1125, 40, 730, 1125, 1125, 365, 180, 1124,~
$ number_of_reviews           <dbl> 80, 33, 0, 12, 149, 6, 170, 6, 8, 0, 118, ~
$ number_of_reviews_ltm       <dbl> 4, 0, 0, 2, 0, 3, 0, 2, 0, 0, 0, 0, 0, 0, ~
$ review_scores_rating        <dbl> 4.87, 4.58, NA, 4.92, 4.68, 4.00, 4.64, 5.~
$ review_scores_cleanliness   <dbl> 4.81, 4.56, NA, 4.83, 4.92, 4.20, 4.89, 4.~
$ review_scores_checkin       <dbl> 4.80, 4.75, NA, 5.00, 4.78, 4.80, 4.84, 4.~
$ review_scores_communication <dbl> 4.89, 4.82, NA, 5.00, 4.71, 4.40, 4.80, 4.~
$ review_scores_location      <dbl> 4.77, 4.21, NA, 5.00, 4.70, 4.20, 4.90, 5.~
$ review_scores_value         <dbl> 4.85, 4.67, NA, 4.83, 4.72, 4.40, 4.71, 4.~
$ review_scores_accuracy      <dbl> 4.91, 4.72, NA, 4.75, 4.80, 4.20, 4.73, 4.~
$ instant_bookable            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ~
$ prop_type_simplified        <chr> "Private room in rental unit", "Private ro~
  • after we had selected the raw data to our dara frame, there are 24 variables and 18,909 observations in total. There are 22 variables in numeric (e.g., price or maximum_nights) and variables in character (e.g., room_type or prop_type_simplified).

  • Computing summary statistics of the variables of interest, or finding NAs

    • mosaic::favstats()
favstats(~review_scores_rating, data=subset_listings)
minQ1medianQ3maxmeansdnmissing
04.54.754.9454.550.783138775032
favstats(~price, data=subset_listings)
minQ1medianQ3maxmeansdnmissing
0366310299994.911418637272
  • skimr::skim()
skimr::skim(subset_listings)
Data summary
Name subset_listings
Number of rows 18909
Number of columns 24
_______________________
Column type frequency:
character 2
numeric 22
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
room_type 0 1 10 15 0 4 0
prop_type_simplified 0 1 5 32 0 5 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
host_since 26 1.00 17108.95 927.82 14290.00 16476.00 17086.00 17876.00 18879.00 <U+2581><U+2583><U+2587><U+2587><U+2586>
host_response_rate 6937 0.63 0.88 0.26 0.00 0.92 1.00 1.00 1.00 <U+2581><U+2581><U+2581><U+2581><U+2587>
host_is_superhost 26 1.00 0.18 0.39 0.00 0.00 0.00 0.00 1.00 <U+2587><U+2581><U+2581><U+2581><U+2582>
host_listings_count 26 1.00 11.61 32.34 0.00 1.00 2.00 6.00 515.00 <U+2587><U+2581><U+2581><U+2581><U+2581>
latitude 0 1.00 40.42 0.02 40.33 40.41 40.42 40.43 40.57 <U+2581><U+2587><U+2583><U+2581><U+2581>
longitude 0 1.00 -3.69 0.03 -3.89 -3.71 -3.70 -3.69 -3.55 <U+2581><U+2581><U+2587><U+2582><U+2581>
accommodates 0 1.00 3.09 1.96 0.00 2.00 2.00 4.00 16.00 <U+2587><U+2585><U+2581><U+2581><U+2581>
bedrooms 1434 0.92 1.44 0.83 1.00 1.00 1.00 2.00 18.00 <U+2587><U+2581><U+2581><U+2581><U+2581>
beds 341 0.98 1.92 1.45 0.00 1.00 1.00 2.00 24.00 <U+2587><U+2581><U+2581><U+2581><U+2581>
price 272 0.99 94.87 113.76 0.00 36.00 63.00 102.00 999.00 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_nights 0 1.00 7.30 36.01 1.00 1.00 2.00 3.00 1125.00 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_nights 0 1.00 1370.15 81133.97 1.00 90.00 1125.00 1125.00 11111111.00 <U+2587><U+2581><U+2581><U+2581><U+2581>
number_of_reviews 0 1.00 32.71 65.25 0.00 0.00 5.00 32.00 744.00 <U+2587><U+2581><U+2581><U+2581><U+2581>
number_of_reviews_ltm 0 1.00 3.97 9.94 0.00 0.00 0.00 3.00 256.00 <U+2587><U+2581><U+2581><U+2581><U+2581>
review_scores_rating 5032 0.73 4.55 0.78 0.00 4.50 4.75 4.94 5.00 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_cleanliness 5269 0.72 4.66 0.51 1.00 4.56 4.80 4.98 5.00 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_checkin 5269 0.72 4.78 0.43 1.00 4.75 4.90 5.00 5.00 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_communication 5269 0.72 4.78 0.46 1.00 4.75 4.92 5.00 5.00 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_location 5272 0.72 4.79 0.38 1.00 4.73 4.92 5.00 5.00 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_value 5273 0.72 4.59 0.50 1.00 4.50 4.71 4.88 5.00 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_accuracy 5271 0.72 4.72 0.47 1.00 4.67 4.86 5.00 5.00 <U+2581><U+2581><U+2581><U+2581><U+2587>
instant_bookable 0 1.00 0.49 0.50 0.00 0.00 0.00 1.00 1.00 <U+2587><U+2581><U+2581><U+2581><U+2587>
  • Creating informative visualizations.
    • ggplot2::ggplot()
      • geom_histogram() or geom_density() for numeric continuous variables
col_name = colnames(subset_listings_continous)
require(cowplot)
for(i in 1:19){
  assign(paste("g", i, sep = ""), i)    
}
k<-0
for (i in col_name){
  k<-k+1
  assign(paste("g", k, sep = ""), ggplot(subset_listings_continous, aes_string(x=i)) +
      geom_density(alpha=0.2)+
      theme_bw() +                #theme
      labs (
        title = paste("Density Plot for ", i),
        y     = "Density"         
      ))
}

plot_grid(g1,g2,g3,g4,g5,g6,nrow = 2)

plot_grid(g7,g8,g9,g10,g11,g12,nrow = 2)

plot_grid(g13,g14,g15,g16,g17,g18,g19,nrow = 3)

    * `geom_bar()` or `geom_col()` for categorical variables
    
col_name = colnames(subset_listings_categorical)
k<-0
for (i in col_name){
      k<-k+1
      assign(paste("g", k, sep = ""), 
                   ggplot(subset_listings_categorical, aes_string(x=i)) +
      geom_bar()+
      theme_bw() +                #theme
      theme(text = element_text(size=10),
             axis.text.x = element_text(angle=45, hjust=1)) +
      labs (
        title = paste("Barplot for ", i),
        y     = "Count"    
             
      ))
}

plot_grid(g1,g2,g4,nrow = 1)

plot_grid(g3,g5,nrow = 1)

* `GGally::ggpairs()` for scaterrlot/correlation matrix
# Scatterplots across all variables 

subset_listings_finished <-subset_listings_finished %>% 
  na.omit(subset_listings_finished) 
par(cex = 0.7)
corrplot(cor(subset_listings_continous,use="pairwise.complete.obs"),type="upper", order="hclust", tl.col="black", tl.srt=45, cl.cex = 1/par("cex"))

  • According to the scatter plot, the blue color means positive relationship of correlation, so that they move in the same direction while the red color means negative relationship of correlation, so that they move in opposite direction. The examples of positive correlation are beds and accommodates or review_score_value and review_score_accuracy. The example of negative correlation is host_since and number_of_review.

1.1 Data wrangling

glimpse the data

glimpse(listings)
Rows: 18,909
Columns: 74
$ id                                           <dbl> 6369, 21853, 23001, 24805~
$ listing_url                                  <chr> "https://www.airbnb.com/r~
$ scrape_id                                    <dbl> 2.021091e+13, 2.021091e+1~
$ last_scraped                                 <date> 2021-09-11, 2021-09-11, ~
$ name                                         <chr> "Rooftop terrace room ,  ~
$ description                                  <chr> "Excellent connection wit~
$ neighborhood_overview                        <chr> NA, "We live in a leafy n~
$ picture_url                                  <chr> "https://a0.muscache.com/~
$ host_id                                      <dbl> 13660, 83531, 82175, 3463~
$ host_url                                     <chr> "https://www.airbnb.com/u~
$ host_name                                    <chr> "Simon", "Abdel", "Jesus"~
$ host_since                                   <date> 2009-04-16, 2010-02-21, ~
$ host_location                                <chr> "Madrid, Community of Mad~
$ host_about                                   <chr> "Gay couple, heterofriend~
$ host_response_time                           <chr> "within a few hours", "N/~
$ host_response_rate                           <chr> "100%", "N/A", "97%", "N/~
$ host_acceptance_rate                         <chr> "83%", "N/A", "64%", "N/A~
$ host_is_superhost                            <lgl> TRUE, FALSE, FALSE, FALSE~
$ host_thumbnail_url                           <chr> "https://a0.muscache.com/~
$ host_picture_url                             <chr> "https://a0.muscache.com/~
$ host_neighbourhood                           <chr> "Hispanoamérica", "Aluche~
$ host_listings_count                          <dbl> 1, 2, 10, 1, 1, 10, 3, 4,~
$ host_total_listings_count                    <dbl> 1, 2, 10, 1, 1, 10, 3, 4,~
$ host_verifications                           <chr> "['email', 'phone', 'revi~
$ host_has_profile_pic                         <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ host_identity_verified                       <lgl> TRUE, TRUE, TRUE, FALSE, ~
$ neighbourhood                                <chr> NA, "Madrid, Spain", "Mad~
$ neighbourhood_cleansed                       <chr> "Hispanoamérica", "Cármen~
$ neighbourhood_group_cleansed                 <chr> "Chamartín", "Latina", "A~
$ latitude                                     <dbl> 40.45724, 40.40381, 40.38~
$ longitude                                    <dbl> -3.67688, -3.74130, -3.69~
$ property_type                                <chr> "Private room in rental u~
$ room_type                                    <chr> "Private room", "Private ~
$ accommodates                                 <dbl> 2, 1, 6, 3, 1, 4, 2, 1, 2~
$ bathrooms                                    <lgl> NA, NA, NA, NA, NA, NA, N~
$ bathrooms_text                               <chr> "1 shared bath", "1 bath"~
$ bedrooms                                     <dbl> 1, 1, 3, NA, 1, 3, 1, 1, ~
$ beds                                         <dbl> 1, 1, 5, 1, 1, 3, 2, 1, 1~
$ amenities                                    <chr> "[\"Wifi\", \"Shampoo\", ~
$ price                                        <chr> "$60.00", "$31.00", "$50.~
$ minimum_nights                               <dbl> 1, 4, 15, 5, 2, 1, 5, 1, ~
$ maximum_nights                               <dbl> 1125, 40, 730, 1125, 1125~
$ minimum_minimum_nights                       <dbl> 1, 4, 15, 5, 2, 1, 5, 1, ~
$ maximum_minimum_nights                       <dbl> 1, 4, 15, 5, 2, 1, 5, 1, ~
$ minimum_maximum_nights                       <dbl> 1125, 40, 730, 1125, 1125~
$ maximum_maximum_nights                       <dbl> 1125, 40, 730, 1125, 1125~
$ minimum_nights_avg_ntm                       <dbl> 1.0, 4.0, 15.0, 5.0, 2.0,~
$ maximum_nights_avg_ntm                       <dbl> 1125, 40, 730, 1125, 1125~
$ calendar_updated                             <lgl> NA, NA, NA, NA, NA, NA, N~
$ has_availability                             <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ availability_30                              <dbl> 30, 29, 0, 1, 14, 0, 0, 0~
$ availability_60                              <dbl> 60, 59, 0, 1, 44, 0, 9, 1~
$ availability_90                              <dbl> 90, 89, 0, 25, 74, 0, 39,~
$ availability_365                             <dbl> 180, 364, 222, 115, 349, ~
$ calendar_last_scraped                        <date> 2021-09-11, 2021-09-11, ~
$ number_of_reviews                            <dbl> 80, 33, 0, 12, 149, 6, 17~
$ number_of_reviews_ltm                        <dbl> 4, 0, 0, 2, 0, 3, 0, 2, 0~
$ number_of_reviews_l30d                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ first_review                                 <date> 2016-03-31, 2014-10-10, ~
$ last_review                                  <date> 2019-05-14, 2018-05-29, ~
$ review_scores_rating                         <dbl> 4.87, 4.58, NA, 4.92, 4.6~
$ review_scores_accuracy                       <dbl> 4.91, 4.72, NA, 4.75, 4.8~
$ review_scores_cleanliness                    <dbl> 4.81, 4.56, NA, 4.83, 4.9~
$ review_scores_checkin                        <dbl> 4.80, 4.75, NA, 5.00, 4.7~
$ review_scores_communication                  <dbl> 4.89, 4.82, NA, 5.00, 4.7~
$ review_scores_location                       <dbl> 4.77, 4.21, NA, 5.00, 4.7~
$ review_scores_value                          <dbl> 4.85, 4.67, NA, 4.83, 4.7~
$ license                                      <chr> NA, NA, NA, NA, NA, NA, N~
$ instant_bookable                             <lgl> FALSE, FALSE, FALSE, FALS~
$ calculated_host_listings_count               <dbl> 2, 2, 5, 1, 1, 5, 3, 4, 1~
$ calculated_host_listings_count_entire_homes  <dbl> 0, 0, 4, 1, 0, 4, 3, 0, 1~
$ calculated_host_listings_count_private_rooms <dbl> 2, 2, 1, 0, 1, 1, 0, 4, 0~
$ calculated_host_listings_count_shared_rooms  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ reviews_per_month                            <dbl> 1.21, 0.39, NA, 0.51, 1.6~

To drop any non-numeric characters

listings <- listings %>% 
  mutate(price = parse_number(price))

typeof(listings$price)
[1] "double"

1.2 Propery types

Ranking the most popular of property types. Top 4 are accounting for 85% of total.

#count the top 4 most common property types
prop<-listings %>% 
  count(property_type) %>% 
  arrange(desc(n)) 

head(prop)
property_typen
Entire rental unit9404
Private room in rental unit5311
Private room in residential home816
Entire loft603
Entire condominium (condo)434
Private room in condominium (condo)343
#the proportion of the total listings the 4 most common property types make up
sum(prop$n[1:4])/sum(prop$n)
[1] 0.8532445
##the 4 most common property types make up 85.32%.

#create new variable: prop_type_simplified
listings <- listings %>%
  mutate(prop_type_simplified = case_when(
    property_type %in% c("Entire rental unit","Private room in rental unit","Private room in residential home","Entire loft") ~ property_type, 
    TRUE ~ "Other"
  ))

To check the variables in prop_type_simplified are correct,

k<-listings %>%
  count(property_type, prop_type_simplified) %>%
  arrange(desc(n)) 
head(k)
property_typeprop_type_simplifiedn
Entire rental unitEntire rental unit9404
Private room in rental unitPrivate room in rental unit5311
Private room in residential homePrivate room in residential home816
Entire loftEntire loft603
Entire condominium (condo)Other434
Private room in condominium (condo)Other343

Airbnb is most commonly used for travel purposes, i.e., as an alternative to traditional hotels. We only want to include listings in our regression analysis that are intended for travel purposes:

  • What are the most common values for the variable minimum_nights?
k<-listings %>%
  count(minimum_nights) %>%
  arrange(desc(n)) 
head(k,)
minimum_nightsn
17560
24630
32673
4706
5598
7597
#1 and 2 are the most common values for the  variable `minimum_nights`
  • Is ther any value among the common values that stands out?

30 days stand out.

  • What is the likely intended purpose for Airbnb listings with this seemingly unusual value for minimum_nights?

This could be because someone wants to rent a house through Airbnb

Filter the airbnb data so that it only includes observations with minimum_nights <= 4

listings<-listings %>% 
  filter(minimum_nights <= 4)

2 Mapping

Visualisations of feature distributions and their relations are key to understanding a data set, and they can open up new lines of exploration. While we do not have time to go into all the wonderful geospatial visualisations one can do with R, you can use the following code to start with a map of your city, and overlay all AirBnB coordinates to get an overview of the spatial distribution of AirBnB rentals. For this visualisation we use the leaflet package, which includes a variety of tools for interactive maps, so you can easily zoom in-out, click on a point to get the actual AirBnB listing for that specific point, etc.

The following code, having downloaded a dataframe listings with all AirbnB listings in Milan, will plot on the map all AirBnBs where minimum_nights is less than equal to four (4). You could learn more about leaflet, by following the relevant Datacamp course on mapping with leaflet

leaflet(data = filter(listings, minimum_nights <= 4)) %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   fillColor = "blue", 
                   fillOpacity = 0.4, 
                   popup = ~listing_url,
                   label = ~property_type)

3 Regression Analysis

For the target variable \(Y\), we will use the cost for two people to stay at an Airbnb location for four (4) nights.

Create a new variable called price_4_nights that uses price, and accomodates to calculate the total cost for two people to stay at the Airbnb property for 4 nights. This is the variable \(Y\) we want to explain.

##listings cleaning: only include the data that is intended for travel purposes
data0<-listings %>% 
  filter(accommodates>1,minimum_nights<=4,maximum_nights>=4,price>0,availability_30>=4)%>% 
  mutate(price_4_nights=case_when(
    room_type=="Shared room"~price/accommodates*2*4,
    TRUE ~ price*4
  ))
data<-data0

#data cleaning

data<-data[complete.cases(data[ , c(36,37,38,61,62)]),]

data <- data %>% 
  mutate(bathrooms = parse_number(bathrooms_text))
data$host_has_profile_pic<-as.numeric(data$host_has_profile_pic)
data$host_identity_verified<-as.numeric(data$host_identity_verified)
data$host_is_superhost<-as.numeric(data$host_is_superhost)
data$has_availability<-as.numeric(data$has_availability)
data$instant_bookable<-as.numeric(data$instant_bookable)
data<-data %>% 
  mutate(weeks=difftime( "2021-9-30",data$host_since ,units = "weeks"))
data<- data %>% 
  mutate(weeks = parse_number(as.character(weeks)))
data<-data %>% 
  filter(price_4_nights>55) %>% 
  filter(price_4_nights<4000)

Use histograms or density plots to examine the distributions of price_4_nights and log(price_4_nights). Which variable should you use for the regression model? Why?

data %>% 
  ggplot(aes(x=price_4_nights))+
  geom_histogram()+
  theme_bw()+
#name the graph
   labs (
    title = "Distribution of Price",
    y = "count",
    x = "price_4_nights"
  )+
  NULL

data %>% 
  ggplot( aes(x=log(price_4_nights)))+
  geom_histogram()+
   theme_bw()+
  #name the graph
   labs (
    title = "Distribution of log(Price)",
    y = "count",
    x = "log(price_4_nights)"
  )+
  NULL

#almost normally distributed
  • We would suggest you to use the log regression model. Due to the fact that the linear regression model provides right skewed which is misleading. Therefore, in order to eliminated the problem, the log regression model will provide the normal distribution.

Fit a regression model called model1 with the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating.

model1<-lm(log(price_4_nights)~prop_type_simplified+number_of_reviews+review_scores_rating,data)
summary(model1)

Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews + 
    review_scores_rating, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.68619 -0.36154 -0.05145  0.28801  2.56814 

Coefficients:
                                                       Estimate Std. Error
(Intercept)                                           5.1371824  0.0914471
prop_type_simplifiedEntire rental unit                0.1905927  0.0479865
prop_type_simplifiedOther                            -0.0914510  0.0519197
prop_type_simplifiedPrivate room in rental unit      -0.7529075  0.0511307
prop_type_simplifiedPrivate room in residential home -0.7226708  0.0681376
number_of_reviews                                    -0.0012297  0.0001007
review_scores_rating                                  0.1603604  0.0168643
                                                     t value Pr(>|t|)    
(Intercept)                                           56.177  < 2e-16 ***
prop_type_simplifiedEntire rental unit                 3.972 7.24e-05 ***
prop_type_simplifiedOther                             -1.761   0.0782 .  
prop_type_simplifiedPrivate room in rental unit      -14.725  < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -10.606  < 2e-16 ***
number_of_reviews                                    -12.217  < 2e-16 ***
review_scores_rating                                   9.509  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5401 on 4509 degrees of freedom
Multiple R-squared:  0.3206,    Adjusted R-squared:  0.3197 
F-statistic: 354.7 on 6 and 4509 DF,  p-value: < 2.2e-16
  • Interpret the coefficient prop_type_simplifiedEntire rental unit in terms of price_4_nights. Holding the other variables, if the property is Entire rental instead of Entire loft , price_4_nights will increase by 20.9967% (because a logarithmic transformation is performed on ‘price_4_nights’ variable).

  • Interpret the coefficient prop_type_simplifiedOther in terms of price_4_nights. Holding the other variables, if the property is Other instead of Entire loft , price_4_nights will decrease by 9.5763% (because a logarithmic transformation is performed on ‘price_4_nights’ variable).

  • Interpret the coefficient prop_type_simplifiedPrivate room in rental unit in terms of price_4_nights. Holding the other variables, if the property is Private room in rental instead of Entire loft, price_4_nights will decrease by 112.3164% (because a logarithmic transformation is performed on ‘price_4_nights’ variable).

  • Interpret the coefficient prop_type_simplifiedPrivate room in residential home in terms of price_4_nights. Holding the other variables, if the property is Private room in residential home instead of Entire loft, price_4_nights will decrease by 105.9928% (because a logarithmic transformation is performed on ‘price_4_nights’ variable).

  • Interpret the coefficient of number_of_reviews in terms of price_4_nights. Holding the other variables, each unit increase in number_of_reviews will increase price_4_nights by 117.3934% (because a logarithmic transformation is performed on ‘price_4_nights’ variable).

  • Interpret the coefficient review_scores_rating in terms of price_4_nights. Holding the other variables, each unit increase in review_scores_rating will increase price_4_nights by 0.16925% (because a logarithmic transformation is performed on ‘price_4_nights’ variable).

We want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model. Fit a regression model called model2 that includes all of the explananatory variables in model1 plus room_type.

model2<-lm(log(price_4_nights)~prop_type_simplified+room_type+number_of_reviews+review_scores_rating,data)
summary(model2)

Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + room_type + 
    number_of_reviews + review_scores_rating, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.42481 -0.36193 -0.05797  0.27520  2.84489 

Coefficients:
                                                       Estimate Std. Error
(Intercept)                                           5.171e+00  8.901e-02
prop_type_simplifiedEntire rental unit                1.905e-01  4.669e-02
prop_type_simplifiedOther                             1.977e-01  5.511e-02
prop_type_simplifiedPrivate room in rental unit      -1.287e-01  6.693e-02
prop_type_simplifiedPrivate room in residential home -9.894e-02  7.998e-02
room_typeHotel room                                  -1.805e-02  7.772e-02
room_typePrivate room                                -6.241e-01  4.475e-02
room_typeShared room                                 -1.205e+00  1.349e-01
number_of_reviews                                    -1.218e-03  9.795e-05
review_scores_rating                                  1.530e-01  1.642e-02
                                                     t value Pr(>|t|)    
(Intercept)                                           58.094  < 2e-16 ***
prop_type_simplifiedEntire rental unit                 4.079  4.6e-05 ***
prop_type_simplifiedOther                              3.587 0.000338 ***
prop_type_simplifiedPrivate room in rental unit       -1.922 0.054665 .  
prop_type_simplifiedPrivate room in residential home  -1.237 0.216153    
room_typeHotel room                                   -0.232 0.816404    
room_typePrivate room                                -13.945  < 2e-16 ***
room_typeShared room                                  -8.934  < 2e-16 ***
number_of_reviews                                    -12.431  < 2e-16 ***
review_scores_rating                                   9.318  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5254 on 4506 degrees of freedom
Multiple R-squared:  0.3574,    Adjusted R-squared:  0.3561 
F-statistic: 278.4 on 9 and 4506 DF,  p-value: < 2.2e-16
anova(model1,model2)
Res.DfRSSDfSum of SqFPr(>F)
4.51e+031.32e+03           
4.51e+031.24e+03371.185.85.42e-54
vif(model2)
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 6.774042  4        1.270153
room_type            6.672320  3        1.372080
number_of_reviews    1.033018  1        1.016375
review_scores_rating 1.011501  1        1.005734
##However, there is multicollinearity

From the comparison, F = 85.812 with a p-value = 2.2e-16. Therefore, room_type is significant. However, there is multicollinearity

3.1 Further variables/questions to explore on our own

Our dataset has many more variables, so here are some ideas on how you can extend your analysis

  1. Are the number of bathrooms, bedrooms, beds, or size of the house (accomodates) significant predictors of price_4_nights? Or might these be co-linear variables?
##correlation between `bathrooms`, `bedrooms`, `beds`, or size of the house (`accomodates`)
corr<- data %>% 
  dplyr::select(bathrooms,bedrooms,beds,accommodates,number_of_reviews,review_scores_rating)

corr<-corr[complete.cases(corr),]
library(corrplot)
corrplot(corr = cor(corr),order ="AOE", addCoef.col="grey")

#`beds` ,`bedrooms`and `accomodates` are highly correlated

data3<-data %>% 
  dplyr::select(prop_type_simplified,bathrooms,bedrooms,beds,accommodates,number_of_reviews,review_scores_rating,price_4_nights)

model3<-lm(log(price_4_nights)~prop_type_simplified+bathrooms+bedrooms+beds+accommodates+number_of_reviews+review_scores_rating,data3)

summary(model3)

Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + bathrooms + 
    bedrooms + beds + accommodates + number_of_reviews + review_scores_rating, 
    data = data3)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.16083 -0.28587 -0.03412  0.25084  2.68955 

Coefficients:
                                                       Estimate Std. Error
(Intercept)                                           4.630e+00  8.122e-02
prop_type_simplifiedEntire rental unit                6.255e-02  4.201e-02
prop_type_simplifiedOther                            -9.805e-02  4.546e-02
prop_type_simplifiedPrivate room in rental unit      -6.187e-01  4.509e-02
prop_type_simplifiedPrivate room in residential home -6.021e-01  5.998e-02
bathrooms                                             1.380e-01  1.361e-02
bedrooms                                              5.619e-02  1.246e-02
beds                                                  6.901e-03  8.688e-03
accommodates                                          8.175e-02  7.264e-03
number_of_reviews                                    -1.209e-03  8.789e-05
review_scores_rating                                  1.509e-01  1.472e-02
                                                     t value Pr(>|t|)    
(Intercept)                                           57.003  < 2e-16 ***
prop_type_simplifiedEntire rental unit                 1.489    0.137    
prop_type_simplifiedOther                             -2.157    0.031 *  
prop_type_simplifiedPrivate room in rental unit      -13.720  < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -10.039  < 2e-16 ***
bathrooms                                             10.138  < 2e-16 ***
bedrooms                                               4.509 6.66e-06 ***
beds                                                   0.794    0.427    
accommodates                                          11.255  < 2e-16 ***
number_of_reviews                                    -13.756  < 2e-16 ***
review_scores_rating                                  10.256  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4706 on 4505 degrees of freedom
Multiple R-squared:  0.4847,    Adjusted R-squared:  0.4835 
F-statistic: 423.7 on 10 and 4505 DF,  p-value: < 2.2e-16
#`bathrooms`, `bedrooms`,`accommodates` are significant
#select variables
step(model3)
Start:  AIC=-6797.36
log(price_4_nights) ~ prop_type_simplified + bathrooms + bedrooms + 
    beds + accommodates + number_of_reviews + review_scores_rating

                       Df Sum of Sq     RSS     AIC
- beds                  1     0.140  997.72 -6798.7
<none>                               997.59 -6797.4
- bedrooms              1     4.503 1002.09 -6779.0
- bathrooms             1    22.761 1020.35 -6697.5
- review_scores_rating  1    23.293 1020.88 -6695.1
- accommodates          1    28.050 1025.64 -6674.1
- number_of_reviews     1    41.901 1039.49 -6613.6
- prop_type_simplified  4   241.391 1238.98 -5826.7

Step:  AIC=-6798.73
log(price_4_nights) ~ prop_type_simplified + bathrooms + bedrooms + 
    accommodates + number_of_reviews + review_scores_rating

                       Df Sum of Sq     RSS     AIC
<none>                               997.72 -6798.7
- bedrooms              1     5.492 1003.22 -6775.9
- bathrooms             1    22.982 1020.71 -6697.9
- review_scores_rating  1    23.301 1021.03 -6696.5
- number_of_reviews     1    41.967 1039.69 -6614.7
- accommodates          1    51.635 1049.36 -6572.9
- prop_type_simplified  4   241.583 1239.31 -5827.5

Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + bathrooms + 
    bedrooms + accommodates + number_of_reviews + review_scores_rating, 
    data = data3)

Coefficients:
                                         (Intercept)  
                                             4.62539  
              prop_type_simplifiedEntire rental unit  
                                             0.06300  
                           prop_type_simplifiedOther  
                                            -0.09653  
     prop_type_simplifiedPrivate room in rental unit  
                                            -0.61712  
prop_type_simplifiedPrivate room in residential home  
                                            -0.60121  
                                           bathrooms  
                                             0.13854  
                                            bedrooms  
                                             0.05917  
                                        accommodates  
                                             0.08543  
                                   number_of_reviews  
                                            -0.00121  
                                review_scores_rating  
                                             0.15097  
model3<-lm(log(price_4_nights) ~ prop_type_simplified + bathrooms + 
    bedrooms + accommodates + number_of_reviews + review_scores_rating, 
    data = data3)
summary(model3)

Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + bathrooms + 
    bedrooms + accommodates + number_of_reviews + review_scores_rating, 
    data = data3)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.10477 -0.28674 -0.03398  0.25114  2.68858 

Coefficients:
                                                       Estimate Std. Error
(Intercept)                                           4.625e+00  8.101e-02
prop_type_simplifiedEntire rental unit                6.300e-02  4.200e-02
prop_type_simplifiedOther                            -9.653e-02  4.541e-02
prop_type_simplifiedPrivate room in rental unit      -6.171e-01  4.505e-02
prop_type_simplifiedPrivate room in residential home -6.012e-01  5.996e-02
bathrooms                                             1.385e-01  1.360e-02
bedrooms                                              5.917e-02  1.188e-02
accommodates                                          8.543e-02  5.595e-03
number_of_reviews                                    -1.210e-03  8.788e-05
review_scores_rating                                  1.510e-01  1.472e-02
                                                     t value Pr(>|t|)    
(Intercept)                                           57.095  < 2e-16 ***
prop_type_simplifiedEntire rental unit                 1.500   0.1337    
prop_type_simplifiedOther                             -2.126   0.0336 *  
prop_type_simplifiedPrivate room in rental unit      -13.699  < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -10.026  < 2e-16 ***
bathrooms                                             10.188  < 2e-16 ***
bedrooms                                               4.980 6.59e-07 ***
accommodates                                          15.271  < 2e-16 ***
number_of_reviews                                    -13.767  < 2e-16 ***
review_scores_rating                                  10.258  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4706 on 4506 degrees of freedom
Multiple R-squared:  0.4846,    Adjusted R-squared:  0.4836 
F-statistic: 470.7 on 9 and 4506 DF,  p-value: < 2.2e-16
vif(model3)
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.311597  4        1.034487
bathrooms            1.592060  1        1.261769
bedrooms             2.556479  1        1.598899
accommodates         2.786001  1        1.669132
number_of_reviews    1.036908  1        1.018287
review_scores_rating 1.013512  1        1.006733
#No multicollinearity
  • According to the model comparison table, all explainable variables (x) are statistically significant to explain price_4_nights (y) including the new variables that we had added (bathrooms, bedrooms, beds and size of the house). To test the co-linear, we use vif function which show no multi-collinearity in this regression model.
  1. Do superhosts (host_is_superhost) command a pricing premium, after controlling for other variables?
data4<-data %>% 
  dplyr::select(prop_type_simplified,host_is_superhost,bathrooms,bedrooms,accommodates,number_of_reviews,review_scores_rating,price_4_nights)
model4<-lm(log(price_4_nights)~.,data4)

summary(model4)

Call:
lm(formula = log(price_4_nights) ~ ., data = data4)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.10195 -0.28855 -0.03563  0.25080  2.69026 

Coefficients:
                                                       Estimate Std. Error
(Intercept)                                           4.6707780  0.0825791
prop_type_simplifiedEntire rental unit                0.0586328  0.0420022
prop_type_simplifiedOther                            -0.0981776  0.0453828
prop_type_simplifiedPrivate room in rental unit      -0.6193546  0.0450233
prop_type_simplifiedPrivate room in residential home -0.5991017  0.0599223
host_is_superhost                                     0.0466049  0.0167494
bathrooms                                             0.1377639  0.0135915
bedrooms                                              0.0596835  0.0118733
accommodates                                          0.0852729  0.0055908
number_of_reviews                                    -0.0012426  0.0000886
review_scores_rating                                  0.1397705  0.0152464
                                                     t value Pr(>|t|)    
(Intercept)                                           56.561  < 2e-16 ***
prop_type_simplifiedEntire rental unit                 1.396  0.16280    
prop_type_simplifiedOther                             -2.163  0.03057 *  
prop_type_simplifiedPrivate room in rental unit      -13.756  < 2e-16 ***
prop_type_simplifiedPrivate room in residential home  -9.998  < 2e-16 ***
host_is_superhost                                      2.782  0.00542 ** 
bathrooms                                             10.136  < 2e-16 ***
bedrooms                                               5.027 5.18e-07 ***
accommodates                                          15.252  < 2e-16 ***
number_of_reviews                                    -14.025  < 2e-16 ***
review_scores_rating                                   9.167  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4702 on 4505 degrees of freedom
Multiple R-squared:  0.4855,    Adjusted R-squared:  0.4843 
F-statistic: 425.1 on 10 and 4505 DF,  p-value: < 2.2e-16
vif(model4)
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.318734  4        1.035189
host_is_superhost    1.110902  1        1.053993
bathrooms            1.592735  1        1.262036
bedrooms             2.557100  1        1.599093
accommodates         2.786301  1        1.669222
number_of_reviews    1.055401  1        1.027327
review_scores_rating 1.089416  1        1.043751

superhosts does command a pricing premium, after controlling for other variables

  1. Some hosts allow you to immediately book their listing (instant_bookable == TRUE), while a non-trivial proportion don’t. After controlling for other variables, is instant_bookable a significant predictor of price_4_nights?
data5<-data %>% 
  dplyr::select(prop_type_simplified,instant_bookable,host_is_superhost,bathrooms,bedrooms,accommodates,number_of_reviews,review_scores_rating,price_4_nights)

model5<-lm(log(price_4_nights)~.,data5)
summary(model5)

Call:
lm(formula = log(price_4_nights) ~ ., data = data5)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.09429 -0.28770 -0.03699  0.24981  2.68232 

Coefficients:
                                                       Estimate Std. Error
(Intercept)                                           4.645e+00  8.382e-02
prop_type_simplifiedEntire rental unit                5.605e-02  4.202e-02
prop_type_simplifiedOther                            -1.013e-01  4.541e-02
prop_type_simplifiedPrivate room in rental unit      -6.160e-01  4.505e-02
prop_type_simplifiedPrivate room in residential home -5.969e-01  5.992e-02
instant_bookable                                      2.586e-02  1.464e-02
host_is_superhost                                     4.660e-02  1.675e-02
bathrooms                                             1.383e-01  1.359e-02
bedrooms                                              6.079e-02  1.189e-02
accommodates                                          8.473e-02  5.598e-03
number_of_reviews                                    -1.249e-03  8.866e-05
review_scores_rating                                  1.423e-01  1.531e-02
                                                     t value Pr(>|t|)    
(Intercept)                                           55.419  < 2e-16 ***
prop_type_simplifiedEntire rental unit                 1.334  0.18225    
prop_type_simplifiedOther                             -2.230  0.02577 *  
prop_type_simplifiedPrivate room in rental unit      -13.674  < 2e-16 ***
prop_type_simplifiedPrivate room in residential home  -9.962  < 2e-16 ***
instant_bookable                                       1.766  0.07742 .  
host_is_superhost                                      2.783  0.00541 ** 
bathrooms                                             10.174  < 2e-16 ***
bedrooms                                               5.114 3.28e-07 ***
accommodates                                          15.135  < 2e-16 ***
number_of_reviews                                    -14.090  < 2e-16 ***
review_scores_rating                                   9.295  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4701 on 4504 degrees of freedom
Multiple R-squared:  0.4858,    Adjusted R-squared:  0.4846 
F-statistic: 386.9 on 11 and 4504 DF,  p-value: < 2.2e-16
vif(model5)
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.357186  4        1.038915
instant_bookable     1.056433  1        1.027829
host_is_superhost    1.110902  1        1.053993
bathrooms            1.593469  1        1.262327
bedrooms             2.564278  1        1.601336
accommodates         2.794873  1        1.671787
number_of_reviews    1.057321  1        1.028261
review_scores_rating 1.099239  1        1.048446

instant_bookable does command a pricing premium, after controlling for other variables.

  1. For all cities, there are 3 variables that relate to neighbourhoods: neighbourhood, neighbourhood_cleansed, and neighbourhood_group_cleansed. There are typically more than 20 neighbourhoods in each city, and it wouldn’t make sense to include them all in your model. Use your city knowledge, or ask someone with city knowledge, and see whether you can group neighbourhoods together so the majority of listings falls in fewer (5-6 max) geographical areas. You would thus need to create a new categorical variabale neighbourhood_simplified and determine whether location is a predictor of price_4_nights
data %>%
  count(neighbourhood_group_cleansed) %>%
  arrange(desc(n)) 
neighbourhood_group_cleansedn
Centro2488
Salamanca271
Chamberí231
Arganzuela213
Tetuán165
Puente de Vallecas139
Hortaleza121
Retiro117
Carabanchel107
Chamartín99
Ciudad Lineal94
Latina86
Moncloa - Aravaca78
San Blas - Canillejas78
Usera69
Fuencarral - El Pardo49
Barajas37
Villa de Vallecas26
Villaverde22
Moratalaz17
Vicálvaro9
#there are 21 districts in Madrid,divide them into five districts according to the number of houses and the distance from the Centro
data<-data %>%
  mutate(neighbourhood_simplified=
         case_when(
           neighbourhood_group_cleansed=="Centro"~"loop1",
           neighbourhood_group_cleansed %in% c("Salamanca","Chamberí","Arganzuela") ~ "loop2",
           neighbourhood_group_cleansed %in% c("Tetuán","Retiro","Puente de Vallecas","Chamartín") ~ "loop3",
           neighbourhood_group_cleansed %in% c("Carabanchel","Ciudad Lineal","Moncloa - Aravaca","Latina","San Blas - Canillejas","Hortaleza","Usera") ~ "loop4",
           TRUE ~ "loop5"
         ))

data6<-data %>% 
    dplyr::select(prop_type_simplified,neighbourhood_simplified,instant_bookable,host_is_superhost,bathrooms,bedrooms,accommodates,number_of_reviews,review_scores_rating,price_4_nights)

model6<-lm(log(price_4_nights)~.,data6)
summary(model6)

Call:
lm(formula = log(price_4_nights) ~ ., data = data6)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.00336 -0.27735 -0.03766  0.23279  2.85879 

Coefficients:
                                                       Estimate Std. Error
(Intercept)                                           4.727e+00  8.278e-02
prop_type_simplifiedEntire rental unit                7.568e-03  4.138e-02
prop_type_simplifiedOther                            -1.097e-01  4.457e-02
prop_type_simplifiedPrivate room in rental unit      -6.163e-01  4.424e-02
prop_type_simplifiedPrivate room in residential home -5.678e-01  5.889e-02
neighbourhood_simplifiedloop2                         1.684e-02  1.981e-02
neighbourhood_simplifiedloop3                        -1.140e-01  2.273e-02
neighbourhood_simplifiedloop4                        -2.511e-01  2.139e-02
neighbourhood_simplifiedloop5                        -2.255e-01  3.836e-02
instant_bookable                                      9.458e-03  1.444e-02
host_is_superhost                                     4.800e-02  1.644e-02
bathrooms                                             1.296e-01  1.336e-02
bedrooms                                              6.938e-02  1.169e-02
accommodates                                          8.282e-02  5.503e-03
number_of_reviews                                    -1.366e-03  8.819e-05
review_scores_rating                                  1.477e-01  1.503e-02
                                                     t value Pr(>|t|)    
(Intercept)                                           57.102  < 2e-16 ***
prop_type_simplifiedEntire rental unit                 0.183  0.85491    
prop_type_simplifiedOther                             -2.462  0.01386 *  
prop_type_simplifiedPrivate room in rental unit      -13.929  < 2e-16 ***
prop_type_simplifiedPrivate room in residential home  -9.642  < 2e-16 ***
neighbourhood_simplifiedloop2                          0.850  0.39523    
neighbourhood_simplifiedloop3                         -5.017 5.45e-07 ***
neighbourhood_simplifiedloop4                        -11.736  < 2e-16 ***
neighbourhood_simplifiedloop5                         -5.880 4.41e-09 ***
instant_bookable                                       0.655  0.51237    
host_is_superhost                                      2.920  0.00352 ** 
bathrooms                                              9.702  < 2e-16 ***
bedrooms                                               5.934 3.18e-09 ***
accommodates                                          15.051  < 2e-16 ***
number_of_reviews                                    -15.491  < 2e-16 ***
review_scores_rating                                   9.828  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4612 on 4500 degrees of freedom
Multiple R-squared:  0.5055,    Adjusted R-squared:  0.5039 
F-statistic: 306.7 on 15 and 4500 DF,  p-value: < 2.2e-16
vif(model6)
                             GVIF Df GVIF^(1/(2*Df))
prop_type_simplified     1.441599  4        1.046780
neighbourhood_simplified 1.152689  4        1.017921
instant_bookable         1.067086  1        1.032999
host_is_superhost        1.112446  1        1.054726
bathrooms                1.599750  1        1.264812
bedrooms                 2.576954  1        1.605289
accommodates             2.806030  1        1.675121
number_of_reviews        1.086918  1        1.042553
review_scores_rating     1.100501  1        1.049048
anova(model5,model6)
Res.DfRSSDfSum of SqFPr(>F)
4.5e+03995           
4.5e+03957438.244.85.56e-37
##F=44.845 with p< 2.2e-16, the location is significant
  1. What is the effect of avalability_30 or reviews_per_month on price_4_nights, after we control for other variables?
data7<-data %>% 
    dplyr::select(prop_type_simplified,neighbourhood_simplified,instant_bookable,availability_30,host_is_superhost,bathrooms,bedrooms,accommodates,review_scores_rating,reviews_per_month,price_4_nights)
model7<-lm(log(price_4_nights)~.,data7)
summary(model7)

Call:
lm(formula = log(price_4_nights) ~ ., data = data7)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.08748 -0.25785 -0.03334  0.22036  2.63836 

Coefficients:
                                                       Estimate Std. Error
(Intercept)                                           4.4281155  0.0798698
prop_type_simplifiedEntire rental unit               -0.0112065  0.0386681
prop_type_simplifiedOther                            -0.1942975  0.0417525
prop_type_simplifiedPrivate room in rental unit      -0.7457868  0.0416723
prop_type_simplifiedPrivate room in residential home -0.7275543  0.0554074
neighbourhood_simplifiedloop2                         0.0202601  0.0185070
neighbourhood_simplifiedloop3                        -0.1387502  0.0212437
neighbourhood_simplifiedloop4                        -0.2929612  0.0200530
neighbourhood_simplifiedloop5                        -0.2613945  0.0358827
instant_bookable                                      0.0628427  0.0136325
availability_30                                       0.0177101  0.0008806
host_is_superhost                                     0.0975186  0.0154819
bathrooms                                             0.1356782  0.0124748
bedrooms                                              0.0720092  0.0109294
accommodates                                          0.0788687  0.0051457
review_scores_rating                                  0.1676647  0.0140711
reviews_per_month                                    -0.0762672  0.0043085
                                                     t value Pr(>|t|)    
(Intercept)                                           55.442  < 2e-16 ***
prop_type_simplifiedEntire rental unit                -0.290    0.772    
prop_type_simplifiedOther                             -4.654 3.36e-06 ***
prop_type_simplifiedPrivate room in rental unit      -17.896  < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -13.131  < 2e-16 ***
neighbourhood_simplifiedloop2                          1.095    0.274    
neighbourhood_simplifiedloop3                         -6.531 7.24e-11 ***
neighbourhood_simplifiedloop4                        -14.609  < 2e-16 ***
neighbourhood_simplifiedloop5                         -7.285 3.79e-13 ***
instant_bookable                                       4.610 4.14e-06 ***
availability_30                                       20.112  < 2e-16 ***
host_is_superhost                                      6.299 3.28e-10 ***
bathrooms                                             10.876  < 2e-16 ***
bedrooms                                               6.589 4.95e-11 ***
accommodates                                          15.327  < 2e-16 ***
review_scores_rating                                  11.916  < 2e-16 ***
reviews_per_month                                    -17.701  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4311 on 4499 degrees of freedom
Multiple R-squared:  0.568, Adjusted R-squared:  0.5665 
F-statistic: 369.8 on 16 and 4499 DF,  p-value: < 2.2e-16
vif(model7)
                             GVIF Df GVIF^(1/(2*Df))
prop_type_simplified     1.536214  4        1.055131
neighbourhood_simplified 1.159326  4        1.018652
instant_bookable         1.089101  1        1.043600
availability_30          1.255944  1        1.120689
host_is_superhost        1.129026  1        1.062556
bathrooms                1.596089  1        1.263364
bedrooms                 2.577386  1        1.605424
accommodates             2.807799  1        1.675649
review_scores_rating     1.103813  1        1.050625
reviews_per_month        1.158354  1        1.076269
  • Both availability_30 and reviews_per_month are statistically significant to the regression model. For availability_30, holding the other variables, each unit increase in availability_30 will increase price_4_nights by 0.0176054% (because a logarithmic transformation is performed on ‘price_4_nights’ variable). For reviews_per_month,holding the other variables, each unit increase in reviews_per_month will decrease price_4_nights by 0.0695514% (because a logarithmic transformation is performed on ‘price_4_nights’ variable).

3.2 Diagnostics, collinearity, summary tables

  1. Check the residuals, using autoplot(model_x)
library(ggfortify)
autoplot(model7)

  1. As you start building models with more explanatory variables, make sure you use `car::vif(model_x)`` to calculate the Variance Inflation Factor (VIF) for your predictors and determine whether you have colinear variables. A general guideline is that a VIF larger than 5 or 10 is large, and your model may suffer from collinearity. Remove the variable in question and run your model again without it.
car::vif(model7)
                             GVIF Df GVIF^(1/(2*Df))
prop_type_simplified     1.536214  4        1.055131
neighbourhood_simplified 1.159326  4        1.018652
instant_bookable         1.089101  1        1.043600
availability_30          1.255944  1        1.120689
host_is_superhost        1.129026  1        1.062556
bathrooms                1.596089  1        1.263364
bedrooms                 2.577386  1        1.605424
accommodates             2.807799  1        1.675649
review_scores_rating     1.103813  1        1.050625
reviews_per_month        1.158354  1        1.076269
  • According ti above table, GVIF value of all variables are less than 5, therefore we can conclude that there is no collinearity in this regreesion models.
  1. Create a summary table, using huxtable that shows which models you worked on, which predictors are significant, the adjusted \(R^2\), and the Residual Standard Error.
library(huxtable)

huxreg(list("model1" = model1, "model2" = model2, "model3" = model3, "model4" = model4,"model5"=model5,"model6"=model6,"model7"=model7))
model1model2model3model4model5model6model7
(Intercept)5.137 ***5.171 ***4.625 ***4.671 ***4.645 ***4.727 ***4.428 ***
(0.091)   (0.089)   (0.081)   (0.083)   (0.084)   (0.083)   (0.080)   
prop_type_simplifiedEntire rental unit0.191 ***0.190 ***0.063    0.059    0.056    0.008    -0.011    
(0.048)   (0.047)   (0.042)   (0.042)   (0.042)   (0.041)   (0.039)   
prop_type_simplifiedOther-0.091    0.198 ***-0.097 *  -0.098 *  -0.101 *  -0.110 *  -0.194 ***
(0.052)   (0.055)   (0.045)   (0.045)   (0.045)   (0.045)   (0.042)   
prop_type_simplifiedPrivate room in rental unit-0.753 ***-0.129    -0.617 ***-0.619 ***-0.616 ***-0.616 ***-0.746 ***
(0.051)   (0.067)   (0.045)   (0.045)   (0.045)   (0.044)   (0.042)   
prop_type_simplifiedPrivate room in residential home-0.723 ***-0.099    -0.601 ***-0.599 ***-0.597 ***-0.568 ***-0.728 ***
(0.068)   (0.080)   (0.060)   (0.060)   (0.060)   (0.059)   (0.055)   
number_of_reviews-0.001 ***-0.001 ***-0.001 ***-0.001 ***-0.001 ***-0.001 ***        
(0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)           
review_scores_rating0.160 ***0.153 ***0.151 ***0.140 ***0.142 ***0.148 ***0.168 ***
(0.017)   (0.016)   (0.015)   (0.015)   (0.015)   (0.015)   (0.014)   
room_typeHotel room        -0.018                                            
        (0.078)                                           
room_typePrivate room        -0.624 ***                                        
        (0.045)                                           
room_typeShared room        -1.205 ***                                        
        (0.135)                                           
bathrooms                0.139 ***0.138 ***0.138 ***0.130 ***0.136 ***
                (0.014)   (0.014)   (0.014)   (0.013)   (0.012)   
bedrooms                0.059 ***0.060 ***0.061 ***0.069 ***0.072 ***
                (0.012)   (0.012)   (0.012)   (0.012)   (0.011)   
accommodates                0.085 ***0.085 ***0.085 ***0.083 ***0.079 ***
                (0.006)   (0.006)   (0.006)   (0.006)   (0.005)   
host_is_superhost                        0.047 ** 0.047 ** 0.048 ** 0.098 ***
                        (0.017)   (0.017)   (0.016)   (0.015)   
instant_bookable                                0.026    0.009    0.063 ***
                                (0.015)   (0.014)   (0.014)   
neighbourhood_simplifiedloop2                                        0.017    0.020    
                                        (0.020)   (0.019)   
neighbourhood_simplifiedloop3                                        -0.114 ***-0.139 ***
                                        (0.023)   (0.021)   
neighbourhood_simplifiedloop4                                        -0.251 ***-0.293 ***
                                        (0.021)   (0.020)   
neighbourhood_simplifiedloop5                                        -0.226 ***-0.261 ***
                                        (0.038)   (0.036)   
availability_30                                                0.018 ***
                                                (0.001)   
reviews_per_month                                                -0.076 ***
                                                (0.004)   
N4516        4516        4516        4516        4516        4516        4516        
R20.321    0.357    0.485    0.485    0.486    0.506    0.568    
logLik-3622.171    -3496.718    -2998.562    -2994.685    -2993.121    -2904.859    -2599.746    
AIC7260.343    7015.436    6019.124    6013.370    6012.243    5843.719    5235.491    
*** p < 0.001; ** p < 0.01; * p < 0.05.
  • Since R-squared represents a goodness-of-fit measure for linear regression models, meaning that higher R-squared is better. As a result, we would suggest you to use the model7 that has the highest value of R-squared at 0.568, so that 56.8% of the data fit the regression model.
  • Based on model7, the interception is statistically significant to explain this regression model whereas there are 14 variables (prop_type_simplifiedOther, prop_type_simplifiedPrivate room in rental unit, prop_type_simplifiedPrivate room in residential home, bathrooms, bedrooms, accommodates, host_is_superhost, instant_bookable, neighbourhood_simplifiedloop3, neighbourhood_simplifiedloop4, neighbourhood_simplifiedloop5, availability_30, reviews_per_month) are statistically significant to explain this regression model
  • and another 3 variables (prop_type_simplifiedEntire rental unit, number_of_reviews, neighbourhood_simplifiedloop2) are not statistically significant to explain this regression model.
  • The Residual Standard Error for the models are: 0.5401, 0.5254, 0.4706, 0.4702,0.4701,0.4612, 0.4311 seperately, showing that the model7 has the smallest RSE.
  1. Finally, you must use the best model you came up with for prediction. Suppose you are planning to visit the city you have been assigned to over reading week, and you want to stay in an Airbnb. Find Airbnb’s in your destination city that are apartments with a private room, have at least 10 reviews, and an average rating of at least 90. Use your best model to predict the total cost to stay at this Airbnb for 4 nights. Include the appropriate 95% interval with your prediction. Report the point prediction and interval in terms of price_4_nights.
#####Model test
library(rsample)
set.seed(1234)
train_test_split<-initial_split(data7,prop=0.75)
data_train<-training(train_test_split)
data_test<-testing(train_test_split)

rmse_train<-data_train %>% 
  mutate(predictions=predict(model7,.)) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()
rmse_train
[1] 23052.75
rmse_test<-data_test %>% 
  mutate(predictions = predict(model7,.)) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()
rmse_test
[1] 13863.82
summary(model7)

Call:
lm(formula = log(price_4_nights) ~ ., data = data7)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.08748 -0.25785 -0.03334  0.22036  2.63836 

Coefficients:
                                                       Estimate Std. Error
(Intercept)                                           4.4281155  0.0798698
prop_type_simplifiedEntire rental unit               -0.0112065  0.0386681
prop_type_simplifiedOther                            -0.1942975  0.0417525
prop_type_simplifiedPrivate room in rental unit      -0.7457868  0.0416723
prop_type_simplifiedPrivate room in residential home -0.7275543  0.0554074
neighbourhood_simplifiedloop2                         0.0202601  0.0185070
neighbourhood_simplifiedloop3                        -0.1387502  0.0212437
neighbourhood_simplifiedloop4                        -0.2929612  0.0200530
neighbourhood_simplifiedloop5                        -0.2613945  0.0358827
instant_bookable                                      0.0628427  0.0136325
availability_30                                       0.0177101  0.0008806
host_is_superhost                                     0.0975186  0.0154819
bathrooms                                             0.1356782  0.0124748
bedrooms                                              0.0720092  0.0109294
accommodates                                          0.0788687  0.0051457
review_scores_rating                                  0.1676647  0.0140711
reviews_per_month                                    -0.0762672  0.0043085
                                                     t value Pr(>|t|)    
(Intercept)                                           55.442  < 2e-16 ***
prop_type_simplifiedEntire rental unit                -0.290    0.772    
prop_type_simplifiedOther                             -4.654 3.36e-06 ***
prop_type_simplifiedPrivate room in rental unit      -17.896  < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -13.131  < 2e-16 ***
neighbourhood_simplifiedloop2                          1.095    0.274    
neighbourhood_simplifiedloop3                         -6.531 7.24e-11 ***
neighbourhood_simplifiedloop4                        -14.609  < 2e-16 ***
neighbourhood_simplifiedloop5                         -7.285 3.79e-13 ***
instant_bookable                                       4.610 4.14e-06 ***
availability_30                                       20.112  < 2e-16 ***
host_is_superhost                                      6.299 3.28e-10 ***
bathrooms                                             10.876  < 2e-16 ***
bedrooms                                               6.589 4.95e-11 ***
accommodates                                          15.327  < 2e-16 ***
review_scores_rating                                  11.916  < 2e-16 ***
reviews_per_month                                    -17.701  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4311 on 4499 degrees of freedom
Multiple R-squared:  0.568, Adjusted R-squared:  0.5665 
F-statistic: 369.8 on 16 and 4499 DF,  p-value: < 2.2e-16
new_data<-data.frame("Private room in rental unit","loop2",1,30,1,1,2,2,4.6,3)
names(new_data)<-c("prop_type_simplified","neighbourhood_simplified",'instant_bookable','availability_30','host_is_superhost','bathrooms','bedrooms','accommodates','review_scores_rating','reviews_per_month')
price_4_nights_pred<-exp(predict(model7,new_data,interval = "confidence"))
price_4_nights_pred
       fit      lwr     upr
1 215.7548 202.3355 230.064

point prediction is 215.7548 interval is:[202.34, 230.06]

4 Deliverables

4.1 Executive Summary

Our best model is model7 as 56.8% of the data fit this regression model which represent the highest value of r-squared. There are 14 variables (X) in total that are able to explain in the price for 4 night for 2 people (Y). The most 2 important factors that cost the higher price of hotel in Madrid, Spain are the review score rating and bathroom (number of available and cleaness). However, property types and locations have a negative impact on price for 4 night for 2 people. For example, the higher distance from central will cost the hotel for 4 nights for 2 propel to be lower such as Latine, San Blas - Canillejas or Hortaleza.

4.2 Data Exploration and Feature Selection

After we had imported the raw data, we select only the important variables that will have an impact on our analysis. Then, we declare the variable types; double, character or logic, for example.

To make the property types more clarify, we classified it into 4 main groups which are 1) Entire rental unit 2) Private room in rental unit 3) Private room in residential home and 4) Entire loft and the remaining are Other.

To see the big picture of our data frame including some statistics, we use glimpse(), favstats() and skim()

4.3 Model Selection and Validation

First, we imported the raw data and clean it to more organized. Then, we ensure data to meet the assumption; normal distribution by using log rather than linear line. After that, we perform the regression analysis, as we add more variables in the regression model, the r-squared keep improving and those variables are statistically significant to explain Y. Once we got the regression models, then we ran the diagnosis to check for homoscedasticity and made the conclusion in comparison table. In order to select the best model, R-squared represents a goodness-of-fit measure for linear regression models, meaning that higher R-squared is better. As a result, we would suggest you to use the model7 that has the highest value of R-squared at 0.568, so that 56.8% of the data fit the regression model.

4.4 Findings and Recommendations

Focusing on significant variables,ceteris paribus, an increase 1) review_scores_rating 2) bathrooms 3) bedrooms 4) accommodates 5) host_is_superhost 6) instant_bookable and 7) availability_30 will increase the price for 4 nights.

Focusing on significant variables, ceteris paribus, an increase in 1) prop_type_simplifiedOther 2) prop_type_simplifiedPrivate room in rental unit 3)prop_type_simplifiedPrivate room in residential home 4)neighbourhood_simplifiedloop3 5) neighbourhood_simplifiedloop4 6) neighbourhood_simplifiedloop5 and 7) reviews_per_month will decrease the price for 4 nights.

Even though the models are acceptable but the residuals are not normally distributed, so it obey the basic assumptions of OLS regression. Therefore, to improve the analysis, the next step is to modify the residuals to make it normally distributed. To do so, we can use BOX-COC method.

5 Rubric

Your work will be assessed on a rubric which you can find here

6 Acknowledgements